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ABSTRACT 

The  problem  addressed  concerns  the  estimation  of  a  p-dimensional  multi¬ 
variate  density,  given  only  a  set  of  n  observation  vectors,  together  with 
information  that  the  density  function  is  likely  to  be  reasonably  smooth.  A 
solution  is  proposed  which  employs  up  to  n  +  h  p(p+l)  smoothing  parameters, 
all  of  which  may  be  estimated  by  their  posterior  means.  This  avoids  the  well- 
known  difficulties,  associated  with  even  one-dimensional  kernel  estimators,  of 
estimating  the  bandwidth  or  smoothing  parameter  by  a  mathematical  procedure. 

The  posterior  mean  value  function,  unconditional  upon  the  smoothing  parameters, 
turns  out  to  be  a  data-based  mixture  of  multivariate  t-distributions.  The 
corresponding  estimate  of  the  sailing  covariance  matrix  may  be  viewed  as  a 
shrinkage  estimator  of  the  Bayes-Stein  type.  The  results  involve  some  finite 
series  which  may  be  evaluated  by  a  straightforward  simulation  procedure. 

AMS(MOS)  Subject  Classifications:  62605,  62H12 

Key  Words:  multivariate  density  estimation,  Bayes,  Dirichlet,  Wishart, 

multivariate  t-distribution,  smoothing,  bandwidth,  covariance 
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SIGNIFICANCE  AND  EXPLANATION 

A  method  Is  proposed  for  the  smooth  estimation  of  a  multivariate 
density,  given  a  finite  number  of  observation  vectors.  The  main 
procedure  suggested  employs  a  generalized  kernel  estimator,  where 
the  nodes  are  selected  via  a  preliminary  cluster  analysis.  An  exchange¬ 
able  Dlrlchlet  prior  for  the  coefficients  is  used  together  with  an 
Inverted  Wlshart  prior  for  the  smoothing  parameters.  The  posterior 
means  of  these  quantities  are  obtained,  and  the  unconditional  posterior 
mean  value  function  of  the  multivariate  density  is  shown  to  be  a 
finite  mixture  of  multivariate  t-densities.  All  results  Involve  the 
summation  of  a  finite  series;  this  Is  best  evaluated  by  a  simulation 
technique. 
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BAYES  ESTIMATION  OF  A  MULTIVARIATE  DENSITY 


Tc-n  Leonard 
1 .  PRIOR  ASSUMPTIONS 

Suppose  that  the  observation  vectors  x.j,...,xn  are  independent,  given 
their  common  p-dimensional  multivariate  density  g(x);  x  e  Rp.  Generalized 
kernel  estimators  of  the  form 

r  _  r 

g*(x)  =  I  BA  Ax)  for  xeRp;  Z  0.  =  1  (1.1) 

0=1  J  J  ~  ~  0=1  3 

are  considered,  where 

d> j ( x)  =  (2Tr)_JsP|cr^  expI-Mx-CjjV^x-Cj)  (j«l ,r)  (1.2) 

The  assumptions  in  (1.1)  and  (1.2)  place  multivariate  normal  kernels  over 
the  r  points  and  then  estimate  g  by  a  weighted  average  of  these 

kernels,  with  weights  61,...,6r.  The  matrix  C"1  is  related  to  the  idea  of  band¬ 
width  for  one-dimensional  kernel  estimators;  its  specification  has  a  large 
effect  upon  the  estimation  of  g.  (e.g.  Silverman,  1978).  The  unequal 
weights  e,,...,6r,  for  the  points  contrast  the  equal  weights  stipulated 

by  kernel  estimators,  (for  the  n  kernels  centered  on  the  data  points 
x-j,...,xn).  They  are  designed  to  avoid  the  typical  practical  difficulties 
associated  with  equal  weights  e.g.  either  oversmoothing  or  overbumpy  tails 
depending  on  the  choice  of  bandwidth. 

A  number  of  further  assumptions  are  made  about  6j,...,6r;  C,  and 
, • • • They  are 
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(a)  The  proportions  8.j,...,8r  are  taken  to  possess  a  Dirlchlet  prior  distribution  f 


a^-1  an2-l  anr-l 
r 


ir(6|a,n)  *  81  82  ‘  ...  0 


(1.3) 


(0  <  a  <  »;  ) 


over  their  simplex  of  permissible  values;  here  n-|,...»nr  denote  the  respective 
prior  means  of  8-j,...,8r»  and  a  controls  the  degree  of  shrinkage  of  our 
posterior  estimates  towards  hj....,nr. 

The  following  procedures  are  available  for  choosing  n-j»...,nr  and  either 
choosing  or  empirically  estimating  a. 

t 

(I)  If  we  possess  prior  ignorance  about  8j,...,8r  we  could  set 

ct  *  r;  *  r"1,  yielding  a  uniform,  but  proper,  prior  distribution  » 

over  the  simplex. 

(II)  If  we  regard  8^,...,8r  as  exchangeable  then  we  could  set 

*  ...  ■  nr  *  r"1  and  then  estimate  a  from  the  data  by  either  a  hierarchical 
or  empirical  Bayes  method  (see  section  3).  This  yields  data  based  Bayes  -  Stein 
shrinkage  estimators  for  8i,...,8r;  it  is  our  main  suggestion.  The  shrinkage 
estimators  avoid  over  concentration  of  the  density  estimates  at  particular 
nodes  Cj. 

(III)  If  we  possess  prior  Information  to  suggest  suitable  weights  for 

the  r  kernels,  then  we  could  choose  nj,...,nr  accordingly,  and  then  estimate  t- 
o  from  the  data,  as  In  (11). 

t 
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(b)  The  matrix  C  Is  taken  to  possess  an  Inverted  Wlshart  prior  distribution 
ir(C|v,T)  *  c|T|^|cr%(v+p+1>exp{-Jstrace  CT)  (1.4) 

i 

where 

C-1  .  jvp/2  nP(p-l)/4  5  rfVJl!)  (1.5) 

J-l  2 

given  Its  parametric  matrix  T  and  degrees  of  freedom  v.  Here  (v-2)~^T 
Is  a  prior  estimate  of  C  and  v  Is  a  prior  'sample  size1  measuring  the 
strength  of  this  Information. 

The  following  choices  are  available  for  v  and  T. 

*** 

(I)  In  situations  where  there  Is  no  prior  information  about  C;  v  =  1 
and  T  equalling  the  n  *  n  Identity  matrix  are  reasonable 

choices. 

(II)  If  there  Is  related  Information  from  other  samples  then  this  might  also 
be  represented  by  the  specification  of  values  for  v  and  T. 

(III)  Possible  extensions  to  the  present  analysis  include  empirical  estimation 
of  v  when  T  Is  specified.  Also  T  could  be  assumed  to  take  the  In¬ 
traclass  form;  It  Is  then  possible  to  empirically  estimate  v  together 

withe  the  variance  and  correlation  appearing  In  the  special  covariance 
structure. 


(c)  Possible  empirical  choices  of  C-j.. ••»|r  involve 

(i)  Setting  r  =  n  and  5.  =  x.  for  j  =  1 . n;  a  frequent  choice  In  the  literature  of 

v  ““J 

kernel  estimation.  We,  of  course,  make  this  substitution  after  the  prior  to  posterior 
analysis  has  been  performed  for  fixed  5-j,...,£r.  However,  when  r  =  n,  our  procedure 
will  often  undersmooth;  therefore  choices  (11)  and  (Hi)  will  usually  be  preferable. 

(1 i )  Arranging  to  lie  on  an  equally  spaced  p-dimensional  lattice; 

the  choice  of  r  depending  upon  the  fineness  of  the  grid. 

(Hi)  Performing  a  preliminary  cluster  analysis,  and  then  putting  at  least 
one  £  in  the  center  of  each  cluster. 

(iv)  In  both  (11)  and  (ill)  suitable  values  of  r  may  be  obtained  by 
comparing  realizations  of  the  prior  predictive  density,  and  maximizing 
with  respect  to  r  (see  section  6). 

Of  the  above  possibilities  the  choice  a(1i),  b(i)  and  c(i 1 1 )  seems  to  be 
reasonable  if  the  objective  is  to  carry  out  a  data-based  analysis,  which  is 
virtually  free  from  choice  of  prior  parameter  values. 

Note  that  our  prior  assumptions  are  very  different  from  those  proposed 
by  Leonard  (1973,  1978,  1982)  who  uses  a  Gaussian  prl or/1 ogl Stic  transform 
approach  on  function  space.  Whilst  the  covariances  of  Gaussian  processes  are 
useful  In  one-dimension,  they  lead  to  overwhelmingly  complicated  computations 
In  several  dimensions  so  that  it  Is  necessary  to  decrease  the  complexity 
of  the  prior  structure  In  order  to  obtain  reasonable  results. 

Note  that  in  case  c(1 1 1 )  the  procedure  we  describe  is  equivalent  In  spirit  to  the 
data  analytic  procedure  of  assigning  proportions  to  each  cluster  In  accordance  with 
the  number  of  observations  In  the  cluster,  and  then  estimating  C  as  the  common  para¬ 
meter  matrix  In  the  corresponding  discrete  mixture  of  multivariate  normal  distributions. 
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2.  ESTIMATION  OF  6 

The  likelihood  of  0  and  C,  given  the  assumption 
observation  vectors,  is  denoted  by 

*(e,c|x(n))  «  n  z  BA  Ax*) 

1*1  j*l  3  3  -1 

ml  £  e«(i)#«(i)(5i) 

where  fl  is  the  space  of  all  possible  n«  1  vectors  w  * 
where,  for  i  =  l,...,n,  u(1)  is  a  mapping  w(1):  1  -*>  (1,. 
the  set  of  the  first  r  Integers.  Under  the  choices  in 
(2.1)  Implies  that 

t(e,C|x^)  «  ] C | e  n  exp{-»strace  C"1 
-  a  j-1  3 


where 


nj  =  #[1e(l,...,n):u>(1)«j] 


with  En.  -  n,  and 

v 


m 


2w =  if1  (?rlo(i)^5rlo(i)^ 


in  (1.2),  and  the  n 

(2.1) 

(u(l ),..., u»(n)), 
...r)  from  i  to 
(1.2)  for  the  4>j(x), 

V  (z*2> 

(2.3) 


(2.4) 
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Note  that  0  contains  rn  elements;  however  summations  with  respect 
to  uefi  may  be  evaluated  by  computer  simulations  generating  random  elements 
from  fl.  This  procedure  Is  straightforward,  the  same  set  of  vectors 
simulated  for  w  may  be  used  for  each  of  the  summations  Involved  in  this 
analysis. 

The  joint  posterior  distribution  of  6  .  and  C  is  now 

it(e,C|x^,a,n,v,T) 

*  ir(8|a,n)  ir ] C | v,T)  t(6,C|x^nh 

«  | C | V+P'*'n“1 )  z  JJ  e“o  nj  1  exp{->strace[C"1(T+D  )]}  (2.5) 

fl  j=l  J 

Integrating  with  respect  to  C,  we  find  that  the  posterior  density  of  e 

*** 

is  a  mixture  of  Dlrlchlet  densities, 

Tr(e|x(n),a,n)  -  E  Aw6(e|n)/EAu  (2.6) 


where 


r  a.+n4-l 

«(e)  »  k  (n)  n  e.J  J 
“  ”  j»l  J 


(2.7) 


with 


*«.<!!> 


r(o+n)/  n  r(aj+Oj) 
j 


(2.8) 
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and 

Au  ■  n^i 'Wv+")/k„(")  <*•»> 

Bayes  estimates  may  be  readily  elicited  from  (2.6),  for  example,  the 
posterior  mean  vector  of  6,  conditional  upon  a  and  n  Is  given  by 

E(0|x(n),a,n)  -  l  A  a  /IA  (2.10) 

ft  fi 

where 

*  (a1+n1,...,ar+nr)T/(a+n)  (2.11) 

Higher  moments  may  be  obtained  similarly  by  averaging  the  higher  moments 
of  the  Dlrlchlet  distribution  In  (2.7). 

3.  EMPIRICAL  ESTIMATION  OF  THE  SHRINKAGE  PARAMETER 

Integrating  the  joint  density  In  (2.5)  with  respect  to  both  6  and  C, 
gives,  with  the  judicious  Inclusion  of  an  extra  factor  In  the  proportionality 
constant  of  (2.5), 

//*(e,C|x^  ,o,n,v,T)d6dC 


*  l  Wa) 
8 


(2.12) 
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where 


r(a)nr(acj+nj) 

*  r(a+n)nr(as.)  (2.13) 

j  J 

and  A  is  defined  in  (2.9). 

U)  • 

The  expression  in  (2.12)  is,  as  a  function  of  a,  proportional  to  the 
marginal  density  of  x^,  given  a.  It  may  therefore  be  regarded  as  pro¬ 
portional  to  the  "marginal  likelihood"  of  a. 

The  posterior  density  of  a  is  therefore  given  by, 

ir(a|x^,n)  «  "(a)  E  A  B  (a)  (0  <  a  <  «)  (2.14) 

"  Q  W  W 

where  "(a)  represents  the  prior  density  of  a.  It  is  then  possible,  in 
principle,  to  calculate  the  posterior  mean  of  6,  unconditional  upon  a, 
from 

E(0|x*n,,n)  -  /  E(0|x(n^,o,n)n(a|x(nJ,n)da  (2.15) 

where  the  contributions  to  the  Integrand  of  (2.15)  are  given  in  (2.10)  and 
(2.4)  respectively. 

Some  approximations  lead  to  simpler  computations.  Leonard  (1977)  con 
slders  posterior  densities  of  the  form 

TT*(a|x^)  «  ir(a)Z  A  B  (a)  (2.16) 

ft 

and  shows  that,  with  t  ■  (l+a)/(n+a)  and 
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X2  =  E(n.-nn_.  )2/nn.  (2.17) 

i  1  1  1 

the  posterior  mean  of  x  is,  for  large  r,  approximated  by 

x*  =  max  (--a-—,! ) .  (2.18) 

X 

Therefore,  as  an  approximation,  the  corresponding  values  for  a  =  (nx-l)/(l-x) 
may  be  substituted  for  a  on  the  right  hand  side  of  (2.10).  This  works 
even  when  a  =  °°  since  then  a. /a  =  n..  More  precise  versions  of  x*,  in  terms  of 

J  J 

incomplete  Gamma  functions,  are  available. 

4.  ESTIMATION  OF  C. 

Integrating  the  joint  density  in  (2.5)  with  respect  to  6  we  find  that 
the  posterior  density  of  C  is 

tt (C | x)  «  |cf^v+p+n-1)  i  u(u>)exp{-»sC'1(T+D  )}  (4.1) 

~  ~  ~  n  “ 

where 

u(u>)  *  nr(ct.+n.)/r(a+n).  (4.2) 

j  j  j 

The  density  in  (4.1)  is  a  discrete  mixture  of  Wishart  densities.  The 
posterior  mean  matrix  of  C  is 
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E(C)x)  «  (v+n-2)'1  Z  u(w)(T+D  )/Z  u(w) 
ft  '  ft 

-  (v+n-2)'1!  +  (v+n-2)'1  Z  u(w)D, ,/Z  u(u»)  (4.3) 

ft  ""ft 

i 

Here  (v-2)_1J  Is  the  prior  estimate  and  the  second  term  on  the  right 
handslde  of  (4.3)  averages  the  matrix  In  (2. ‘4). 

5.  POSTERIOR  MEAN  VALUE  FUNCTION  OF  THE 
MULTIVARIATE  DENSITY 

The  posterior  mean  of  g*(x)  In  (1.1),  conditional  upon  the  hyper¬ 
parameters  a,  n,  v,  and  T  is 

E(8*(x)|x(n))  -  S  E(e.»  (x)|x(n)) 
j-1  3  3  '  ' 

where 

E<V,<«)|*(n)>  ■  exp(-S(;^)TC-' (x-Et) l«(n>) 

where  the  expectation  Is  with  respect  to  the  joint  distribution  In  (2.5).  After 
some  manipulation  we  find  that 

E(9*(x)|x(n)) 

*  I  W  “(“HI  *  S„  *  )Tr's(v+n"1)  (xeRn)  (5.1) 
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with  u(u)  defined  in  (4.2). 

The  weighted  average  of  multivariate  t-densities  in  (5.1)  is  a  Bayes 
estimate  of  g*(x)  under  both  component  wise  and  integrated  squared  error 
loss;  it  is  also  the  predictive  density  for  a  further  observation  vector  x. 

6.  MISCELLANEOUS  TOPICS 


The  posterior  means  of  the  central  moments  correspond  to  the  central 
moments  of  the  distribution  in  (5.1).  They  may  however  be  obtained  more 
directly  from  (1.1)  and  the  results  in  sections  2-4.  For  example,  the  mean 
vector  and  covariance  matrix  are  estimated  by 

y  «  E  CiE(e1|xtn)) 

~  j=l  J  3  ~ 


and 


Ei[E(ejC|x(n))  +  ij§jE(0j  |x(n))]  -  w1. 


The  covariance  matrix  E  may  be  compared  with  the  sample  covariance 
matrix  calculated  from  Xjt....xn.  It  will  smooth  the  sample  covariance  matrix 
by  shrinking  Its  elements  to  allow  for  our  various  sampling  and  prior  assumptions. 
When  both  a  and  C  are  estimated  empirically  E  will  take  the  form  of  a 
Bayes -Stein  estimate  because  the  amounts  of  shrinkage  depend  primarily  upon 
the  data. 
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Whilst  the  choice  r  »  n  is  sometimes  possible,  the  statistician  might 
prefer  to  work  with  a  fewer  number  of  5's  e.g.  with  p  x  2  he  might, 
for  n  around  200,  spread  the  5's  over  a  10  *  10  grid.  In  such  circumstances, 
we  may  compare  different  values  of  r  by  the  criterion 

log  p(x*n)|a,r)  =  log  2  A  B  (o)-Jjrh  log  it  (6.1) 

n 

where  A  and  are  defined  in  (2.9)  and  (2.13). 

hi  hi 

The  expression  in  (6.1)  is  the  log  of  the  (prior  predictive)  marginal 
density  of  x^ ,  given  r  and  the  hyperparameters.  If  several  different  values 
of  r  possess  equal  prior  probabilities,  then  maximizing  (6.1)  yields  the 
value  of  r  with  maximum  posterior  probability. 

Host  of  the  results  in  this  paper  have  been  stated  explicitly  in  terms 
of  one-dimensional  finite  series.  This  might  be  viewed  as  surprising  in  light 
of  the  apparent  complexity  of  the  problem  of  multivariate  density  estimation. 

The  evaluation  of  the  finite  series  via  computer  simulations  should  not  cause 
any  untoward  difficulltes.  Whilst  elements  u  of  a  should,  formally  speaking 
be  chosen  at  random  without  replacement,  the  chance  of  choosing  the  same  u 
twice  Is  Inflnitesmally  small,  therefore  the  simulations  may  quite 
reasonable  proceed  with  replacement. 
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The  problem  addressed  concerns  the  estimation  of  a  p-dlmenslonal  multi¬ 
variate  density,  given  only  a  set  of  n  observation  vectors,  together  with 
Information  that  the  density  function  Is  likely  to  be  reasonably  smooth.  A 
solution  Is  proposed  which  employs  up  to  n  +  %  p(p+l)  smoothing  parameters, 
all  of  which  may  be  estimated  by  their  posterior  means.  This  avoids  the  well- 
known  difficulties,  associated  with  even  one-dimensional  kernel  estimators,  of 
estimating  the  bandwidth  or  smoothing  parameter  by  a  mathematical  procedure. 

The  posterior  mean  value  function,  unconditional  upon  the  smoothing  parameters. 
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Abstract  (continued) 


urns  out  to  be  a  data-based  mixture  of  multivariate  t-dlstrlbutlons. 
The  corresponding  estimate  of  the  sampling  covariance  matrix  may  be 
viewed  as  a  shrinkage  estimator  of  the  Bayes-Steln  type.  The  results 
Involve  some  finite  series  which  may  be  evaluated  by  a  straightforward 
simulation  procedure. 


